-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3094] [PySpark] compatitable with PyPy #2144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
QA tests have started for PR 2144 at commit
|
|
Tests timed out after a configured wait of |
|
Jenkins, retest this please. |
|
QA tests have started for PR 2144 at commit
|
|
Tests timed out after a configured wait of |
|
This looks like it will be tricky to maintain without automated testing. Can you update dev/run-tests to also run the PySpark tests with PyPy maybe? You might need help from Patrick or others on installing pypy on the Jenkins machines. |
|
QA tests have started for PR 2144 at commit
|
|
Tests timed out after a configured wait of |
|
QA tests have started for PR 2144 at commit
|
|
QA tests have finished for PR 2144 at commit
|
|
QA tests have started for PR 2144 at commit
|
|
QA tests have finished for PR 2144 at commit
|
|
@davies just curious, do all the unit tests run if you do |
|
Yes, I will do that next week. |
|
One concern with adding tests for |
|
How long do the Python tests run now? Anyway, we could do PyPy only if Python code changed (but I'd still do Python all the time). |
|
PyPy is fully compatible with CPython for pure Python code, so it's not necessary to test against every commit with PyPy. Maybe we could have nightly tests (for performance or scalability), we could put PyPy in this kind of tests. |
|
PyPy does not fully support NumPy right now, so MLlib can not run with PyPy. |
|
So you guys should figure out a way to run this so that it doesn't get stale. For example it's fine to add some code to the script that runs all the tests except the MLlib ones. But there's little point merging it unless we also automatically test that it keeps working, otherwise we'll only notice breakage on each release (if we remember to test with PyPy). |
+1 |
|
Let's just have the PyPy tests run by default on Jenkins. If this causes build speed problems later down the road, we can revisit the issue of selectively running tests. |
|
@mateiz @JoshRosen @mattf run-tests will try to run tests for spark core and sql with PyPy. One known issue is that serialization of array in PyPy is similar to Python2.6, which is not supported by Pyrite, so one test cases has been skipped for them. I had added another one which do not depend on serialization of array. Also I had added some refactor in cloudpickle to do it in more portable ways (which is also used by dill). |
|
Jenkins, test this please. |
|
QA tests have started for PR 2144 at commit
|
|
QA tests have finished for PR 2144 at commit
|
|
I'm waiting to figure out the right procedure for installing |
|
Thanks to @shaneknapp we now have |
|
QA tests have started for PR 2144 at commit
|
|
QA tests have finished for PR 2144 at commit
|
|
QA tests have started for PR 2144 at commit
|
|
QA tests have started for PR 2144 at commit
|
|
QA tests have finished for PR 2144 at commit
|
|
QA tests have finished for PR 2144 at commit
|
|
This looks good to me (Davies and I walked through the code offline). I'm going to merge this into |
function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names). There is a regression introduced by #2144, revert part of changes in that PR. cc JoshRosen Author: Davies Liu <[email protected]> Closes #2522 from davies/globals and squashes the following commits: dfbccf5 [Davies Liu] fix bug while pickle globals of function
|
Sorry for the stupid question. Does this work on executors in YARN cluster? Or it's just for local mode only? |
After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:
The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:
Here is the code used for benchmark: